High Availability

RHQ supports running multiple RHQ Servers which provides a High Availability (HA) environment. We call the multiple RHQ Servers the "RHQ HA Server cloud". A multi-server HA environment provides for RHQ Agent failover and distribution of load.

HA is integrated into the standard installation process. There is no separate HA installer. The important points to understand are:

Multiple RHQ Servers can be configured to talk to the same database (in fact, this is required - the RHQ Server cloud must talk to the same database backend)
Each RHQ Server requires a unique name for identification.
Each RHQ Server requires a publicly accessible endpoint (public to the population of RHQ Agents).

Installing or upgrading a single server will result in a single-server environment (a "1-server cloud"). To add more servers to the HA server cloud, simply install RHQ Servers on the target machines, each time configuring for the same database, and selecting unique names and endpoint information for each. RHQ Servers can be added and removed from the cloud at any point in time.

To learn more about whether an HA environment is appropriate for your needs, and how to move to HA from an existing RHQ installation, continue on and read High Availability Configuration.

High Availability does not necessarily equate to larger scale. Increasing the number of RHQ Servers in your HA server cloud will allow you to handle more RHQ Agents, but eventually you will need to scale your backend database as well.

High Availability Configuration

The goal of HA is to support multiple RHQ servers configured against a single database repository. RHQ Agent load can then be partitioned amongst the available RHQ Servers. Failover will occur for agents whose server becomes unreachable. A multi-server HA configuration provides fault tolerance and improved scalability.

Details of the HA Design can be found at Design-High Availability - Agent Failover.

You can watch a demo that illustrates how you can install a new Server into an HA Server cloud. That demo also shows you agent failover in action. You can also watch an Affinity Groups demo to see how Affinity Groups is used to group your Servers and Agents together. For these demos and more, go to Demos.

When You Should Setup High Availability

In many circumstances, it may be satisfactory to run a single-server configuration. But if your environment satisfies one or more of the following criteria you may want to consider a multi-server approach:

Agent report processing is slow or not able to keep up:
- Metrics
- Alert generation
- Event generation
- Resource availability
You have a geographically distributed environment
- Multiple Data Centers
- Logical grouping of agents to servers
You have high agent load
- 100+ Agents (this number can vary widely depending on your RHQ Server and database hardware)
- RHQ Server is having trouble processing the agent load (this is a better indicator)

HA Infrastructure

In reality, every RHQ environment is an "HA configuration". You can consider an environment with a single RHQ server as a 1-Server HA Cloud because it can still be managed via the HA Administration GUI pages.

We will call the set of HA Administration GUI pages the "HA Administraton Console" or "HAAC" for short.

For example, the HAAC pages Administration>Servers, Administration>Agents, etc. are all applicable to a single-server environment and accessible from the RHQ GUI. Since a single-server environment is a 1-Server cloud, it easily adapts to an increase in cloud size.

In general, RHQ Servers can be added or removed from the HA Cloud at any time. So, a single-server environment can be turned into a multi-server environment by installing an RHQ Server on a second machine, configured for the same database, and defining a new server (unique server Name (aka rhq.server.high-availability-name) and public endpoint).

Database Impact

Although RHQ Servers can be added to the HA Server Cloud with relative ease, it should be done cautiously due to potential impact on the back-end database. Each RHQ Server limits its concurrent database connections but there is no restriction on the Cloud itself. Meaning, adding a second server effectively doubles the potential database connections, even if the number of RHQ Agents remains the same. The increase is linear as servers are added.

Each RHQ server instance has built-in mechanisms for limiting the load it will put on the database. In the current RHQ release, that number out-of-the-box is 55 simultaneous connections. Each RHQ server may use less connections (that number based largely on how many agents are connected to it and how much data it needs to process concurrently), but the limits guarantee that they will each never use more than 55 connections to the database at any given point in time. So, for example, a 2-Server configuration would require that the database be able to handle 110 connections.

The RHQ Administrator should work closely with the Database Administrator to ensure an adequate configuration. In general, a large scale RHQ configuration requires DBA planning to handle not only connections, but to provide a database with reasonable data distribution and space allocation. HA impact is just another aspect to take into consideration.

Note that an HA configuration does not necessarily imply a large number of RHQ Agents. It may be the case that a relatively small RHQ implementation may be in place, with only a few RHQ Agents. But, those Agents may need high availability and therefore failover servers are required. In that case the backing database will still have a high number of potential connections, but in reality will not reach that limit.

Server and Agent Endpoints

In a multi-server HA configuration, it is important to realize that any agent could potentially try to connect to any server. Thus, it is critical that every RHQ Agent be able to resolve the Endpoint Address set for every RHQ Server in the HA Server cloud. So, when defining the RHQ Server in the installer, it is important that the Endpoint Address be public to the degree that the RHQ Agent population can resolve the RHQ Server's address and be able to reach the RHQ Server via the defined address and port (or secure port, if configured for secure communications).

Note that the RHQ Server endpoint information can be updated via the HAAC in the RHQ GUI.

Conversely, an RHQ Agent connecting to an RHQ Server must provide an endpoint reachable by all the RHQ Servers in the Cloud in order to allow for the necessary two-way communications.

Failover Lists

Each agent will be assigned a "failover list". A failover list helps the agent determine which server it should communicate with. A failover list has one or more server public endpoints in it.

Each server has a public endpoint address associated with it (which can be either a hostname or IP address). Those server public endpoints are used as failover list entries. Therefore, it is very important that all servers are assigned public endpoint addresses that are resolvable by all agents and that all agents have connectivity to those addresses. You can view all your server public endpoint addresses in the HAAC's server list view.

A failover list is ordered - the first server in the failover list is considered the agent's "primary server". The primary server is the server that the agent should try to communicate with first. If, for some reason, the agent cannot talk to the primary server, the agent will move down the failover list, trying to talk to each server in the order they appear in the failover list. For example, if the first server (the primary server) is down, the agent will attempt to communicate with server #2 in the list. If the agent can't talk to that server, the agent will continue down the list (trying server #3 next) until it successfully communicates with a server.

If the agent exhausts its entire failover list and still cannot communicate with any server, the agent enters a mode where it temporarily stops trying to send messages over the wire and will start to spool the messages to disk so it can retry them later. When the agent discovers that a server has come back online (it usually does this by periodically polling all servers in its failover list until it finds one that it can talk to), the agent will send messages it previously spooled to disk to that online server and will continue to talk to that server normally.

An agent will try to ensure that it is connected to its primary server. Every hour (which is the default setting) the agent will check to see if the server it is currently talking to is its primary server. If it is not (for example, if the agent had recently failed over to one of its secondary servers due to its primary server going down), the agent will attempt to re-connect with its primary server. This helps maintain desired affinity and keeps the server/agent HA infrastructure in its most efficient configuration.

All failover lists for all agents are generated by the server. An agent obtains its failover list when it registers with the server and periodically thereafter (by default, the agent will check every hour to see if it has been assigned a new failover list). A failover list can change when new servers and new agents are added to the environment and when affinity is changed (i.e. when an agent is added or removed from an affinity group).

Registration Server

When you initially setup the agent and you assign it an initial server hostname/IP address, you are merely assigning the agent its temporary "registration server" - this server will not necessarily be considered its primary server. Once the agent initially communicates with your "registration server", the agent will get its failover list and will immediately attempt to switch over to its primary server. If the primary server is actually the same as the registration server, then nothing extra needs to happen and the agent continues on knowing that it is talking to its primary server. However, if the primary server is different than the registration server, the agent will immediately turn away from the registration server and start to communicate with the primary server. If the primary server is down or otherwise cannot be connected to, the agent will start the normal failover mechanism, stepping through its failover list looking for a server to talk to. Once the agent gets a failover list, it will use that to determine which servers to talk to (i.e. the registration server you assigned the agent at startup will no longer dictate which server the agent will talk to).

You can examine an agent's failover list in one of several ways:

From the GUI, go to the HAAC's individual agent's view and you'll see the ordered list of servers in the agent's failover list
Execute the agent prompt command "failover --list"
Look at the content of the agent's data/failover-list.dat file

Affinity

By default, agent load is distributed evenly amongst the servers in the cloud. Balance can change in failover situations but in general, by default, agent load will be evenly distributed when all agents and all servers are running.

This is fine when it is unimportant which RHQ Agents connect to which RHQ Servers. But there are use cases where it may be desirable to create stronger bonds between specific agents and servers. This is accomplished by defining RHQ HA Affinity Groups. RHQ Agents will prefer connecting to RHQ Servers in the same Affinity Group. Affinity Group assignment is optional and any given RHQ Agent or RHQ Server can participate in at most one Affinity Group.

Affinity Behavior

Affinity is described in more detail in the HA Design Document but the following basics should be understood about Affinity Group behaviour:

Affinity is strong. RHQ will satisfy Affinity Group preference and then apply load balancing.
An RHQ Agent will fail over to an available RHQ Server in its affinity group before it fails over to a non-affinity RHQ Server.
Affinity is not guaranteed. Although strong, affinity can be broken if no affinity server is available. What this means is that all RHQ Servers will be present in an RHQ Agent's failover list. However, those servers with affinity to the agent come first (i.e. the agent will try to failover to those servers first).
RHQ attempts to distribute load evenly within the Affinity Group.

When To Use Affinity

Following are scenarios that may benefit from Affinity Group assignment.

Physical Efficiency

In general, if it is clear that certain agent-server connections will run more efficiently than others, then defining affinity to prefer those connections makes sense. This could include RHQ Servers and RHQ Agents co-located in the same data center, other geographic grouping, or various network topology scenarios.

Logical Efficiency

It may not be the case that certain agents and servers will run more efficiently by talking to one another, but that there are other reasons to group agents and servers together. For example, organizational reasons such as administration responsibilities and business units are some logical reasons to use affinity grouping.

Warm Backup

It may be the case that certain machines should not be assigned agent load unless specifically needed for failover purposes. In this case you would have all agents assigned affinity to a subset of the available servers, leaving some servers without any associated agents in normal operation.

Moving to an HA installation

After deciding that an RHQ High Availability strategy is appropriate for your needs, you should do two things to prepare for you installation or upgrade:

Ensure your database is ready to support the load and connections. See Database Impact.
Determine your affinity strategy. See Affinity

Note that affinity assignments can be added or removed at any time but it is useful to consider your initial approach, even if it confirms that affinity assignments are unnecessary. From an installation and upgrade perspective, an HA environment does not require different steps, and actually can be moved to incrementally.

Server Requirements For HA

Each RHQ Server in an HA Server Cloud must:

be running compatible versions of RHQ. Unless specifically noted, this means the exact same version of RHQ.
be uniquely named. The server names are defined during RHQ Server installation.
define a unique endpoint that is resolvable and reachable by all RHQ Agents running against the HA Server Cloud. This address/port or address/secure port combination is defined during installation. This requirement exists because any given RHQ Agent may be talking to any RHQ Server at a given time. It is not the case that the RHQ Server defined in the RHQ Agent's initial configuration (from the agent setup questions, for example) will be the RHQ Server that the agent is told to connect to later.

Note that the first RHQ Server installed will become the initial member in the HA Server Cloud. This means that a single-server installation can also be thought of as a 1-Server HA Cloud and therefore has the same Server requirements. The RHQ GUI HA Administration Console pages are still usable to inspect or manage your environment.

Perhaps, more importantly, this allows the 2nd, 3rd...Nth RHQ Servers to be added at any time, even while other RHQ Servers are running. Conversely, RHQ Servers can be removed from the Cloud at any time.

RHQ Servers communicate solely via the database and therefore it is not required that their endpoints be visible to each other. No direct server-to-server communication is ever made.

Agent Requirements for HA

Each RHQ Agent in an HA environment must:

be running versions compatible with the RHQ Servers. Unless specifically noted, this means the exact same version as that required by all RHQ Servers.
be configured initially to contact any RHQ Server in the HA cloud.
define an endpoint address that is resolvable by, and connectable to, any and all RHQ Servers in the HA Server Cloud. This is because any given RHQ Agent may be connected to any given RHQ Server at a given time. Note that the RHQ Server defined in the RHQ Agent's initial configuration (from the setup questions, for example) may not be the RHQ Server that the agent will talk to in the future. That initial RHQ Server the agent is configured for will be used to initially register the agent, but it may not be the primary server the agent is assigned.

To install or upgrade an RHQ Agent, there is nothing different to do, other than the normal agent install/upgrade steps. Since HA environments typically involve many agents it may be useful to pre-configured your Agents to avoid having to answer initial setup questions interactively.

Managing an HA installation

Even a 1-Server installation can take advantage of certain HA management capabilities. But after adding a second server, or more, it will be useful to become comfortable with the HA management features available in RHQ. In general, the steps to take when building up and managing an HA Server Cloud are:

Familiarize yourself with the HA Administration Console (HAAC) pages. You will use these administration pages in the RHQ GUI to manage and monitor your HA environment. These pages are available even when running a single server so it should be easy to get comfortable with their use.
Add your desired RHQ Server(s). Increasing the size of of your HA Server Cloud is done by running the installer on the desired Server node(s). To do this, follow the RHQ Server installation instructions. Use the Administration>Servers page to examine your installed RHQ Servers.
Add your desired RHQ Agents. To do this, follow the RHQ Agent installation instructions. Use the Administration>Agents page to examine your registered RHQ Agents.
Update your Server Affinity Group membership. If you have decided to use Affinity Groups to enable Agent-Server connection preferences, then you can use the Administration>AffinityGroups page to create your desired Affinity Groups and associate your RHQ Servers as desired.
Update your Agent Affinity Group membership. If you have decided to use Affinity Groups to enable Agent-Server connection preferences, then you can use the Administration>Affinity Groups page to associate your RHQ Agents as desired.
Inspect your topology. It may take some time, typically less than an hour, after making these changes before your Agent population establishes a steady state, connecting themselves to their expected primary RHQ Servers. You can use the HAAC pages to inspect the topology and observe the Agents incrementally migrate.
Optionally, test failover. When you are satisfied that your HA Server Cloud is servicing Agents as expected, you may want to test agent failover. To do this, you can set Servers to Maintenance Mode via the Administration>Servers page. Maintenance Mode essentially takes the RHQ Server offline so it stops accepting messages from agents (to an agent, it looks like that RHQ Server has gone offline).

JBoss Community Archive (Read Only)

RHQ 4.10

High Availability

High Availability

High Availability Configuration

When You Should Setup High Availability

HA Infrastructure

Database Impact

Server and Agent Endpoints

Failover Lists

Affinity

Affinity Behavior

When To Use Affinity

Physical Efficiency

Logical Efficiency

Warm Backup

Moving to an HA installation

Server Requirements For HA

Agent Requirements for HA

Managing an HA installation